GH-49438: [C++][Gandiva] Optimize LPAD/RPAD functions by dmitry-chirkov-dremio · Pull Request #49439 · apache/arrow

dmitry-chirkov-dremio · 2026-03-03T14:35:21Z

Rationale for this change

The lpad_utf8_int32_utf8 and rpad_utf8_int32_utf8 functions have performance inefficiency and a potential memory safety issue:

Performance: Single-byte fills iterate character-by-character when memset would suffice. Multi-byte fills use O(n) iterations instead of O(log n) with a doubling strategy.
Memory safety: When the fill string is longer than the padding space needed, the code could write more bytes than allocated. Fixed preventatively.

What changes are included in this PR?

Memory safety fix: Use std::min(fill_text_len, total_fill_bytes) for the initial copy to prevent overflow
Fast path: Add single-byte fill optimization using memset
General path: Replace character-by-character loop with doubling strategy for multi-byte fills
Tests: Add comprehensive tests for the new code paths

Are these changes tested?

Yes. Added tests covering:

Large UTF-8 fill characters (4-byte emoji, 3-byte Chinese)
Single-byte fill boundaries (1 char and 65536 char padding)
Content verification for fill patterns
Doubling strategy boundaries
Partial fill scenarios (fill text longer than padding needed)

Are there any user-facing changes?

No.

GitHub Issue: [C++][Gandiva] Optimize LPAD/RPAD functions: fix memory safety issue and improve performance #49438

github-actions · 2026-03-03T14:35:48Z

⚠️ GitHub issue #49438 has no components, please add labels for components.

dmitry-chirkov-dremio · 2026-03-03T14:39:04Z

Local Benchmark Results

Platform: Apple M3, macOS
Benchmark: cpp/src/gandiva/tests/micro_benchmarks.cc, 10 repetitions, 1 million rows per test

The original RPAD was pathologically slow compared to LPAD due to different algorithms. For 65K padding: LPAD took ~29ms while RPAD took ~992ms (34x slower for identical operation). The optimization applies the same efficient algorithm to both functions.

LPAD (mean time in μs)

Benchmark	Original	Optimized	Speedup
Minimal (9 padding chars)	147	148	-
Small (99 padding chars)	312	167	1.9x
Medium (100 padding chars)	368	219	1.7x
Large (1000 padding chars)	16,242	16,273	-
XLarge (65436 padding chars)	29,115	27,987	1.04x

RPAD (mean time in μs)

Benchmark	Original	Optimized	Speedup
Minimal (9 padding chars)	247	148	1.7x
Small (99 padding chars)	1,704	165	10x
Medium (100 padding chars)	1,773	216	8x
Large (1000 padding chars)	30,813	16,082	1.9x
XLarge (65436 padding chars)	992,334	27,724	36x

github-actions · 2026-03-03T19:51:09Z

⚠️ GitHub issue #49438 has no components, please add labels for components.

github-actions · 2026-03-03T22:15:01Z

⚠️ GitHub issue #49438 has no components, please add labels for components.

kou · 2026-03-04T00:10:35Z

@lriggs @akravchukdremio @xxlaykxx You may want to review this.

kou

It seems that lpad_utf8_int32_utf8() and rpad_utf8_int32_utf8() have many duplicated code. Can we factor out it as a helper function?

cpp/src/gandiva/precompiled/string_ops.cc

cpp/src/gandiva/precompiled/string_ops_test.cc

cpp/src/gandiva/precompiled/string_ops.cc

kou

+1

conbench-apache-arrow · 2026-03-13T11:22:14Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 5707713.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

GH-XXXXX: [C++][Gandiva] Optimize lpad/rpad UTF-8 functions

81c1bde

github-actions bot added Component: C++ Component: Gandiva awaiting review Awaiting review labels Mar 3, 2026

dmitry-chirkov-dremio changed the title ~~GH-49438: [C++][Gandiva] Optimize lpad/rpad UTF-8 functions~~ GH-49438: [C++][Gandiva] Optimize LPAD/RPAD functions Mar 3, 2026

lriggs approved these changes Mar 6, 2026

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 6, 2026

kou reviewed Mar 11, 2026

View reviewed changes

cpp/src/gandiva/precompiled/string_ops.cc Outdated Show resolved Hide resolved

cpp/src/gandiva/precompiled/string_ops_test.cc Outdated Show resolved Hide resolved

cpp/src/gandiva/precompiled/string_ops_test.cc Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting committer review Awaiting committer review labels Mar 11, 2026

Address review feedback: extract helper function and simplify tests

85ae4f7

github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Mar 11, 2026

kou reviewed Mar 11, 2026

View reviewed changes

cpp/src/gandiva/precompiled/string_ops.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 11, 2026

Add static to internal helper function

8415ac5

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 13, 2026

kou approved these changes Mar 13, 2026

View reviewed changes

kou merged commit 5707713 into apache:main Mar 13, 2026
51 of 52 checks passed

kou removed the awaiting change review Awaiting change review label Mar 13, 2026

kou mentioned this pull request Mar 13, 2026

[C++][Gandiva] Optimize LPAD/RPAD functions: fix memory safety issue and improve performance #49438

Closed

github-actions bot added the awaiting merge Awaiting merge label Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-49438: [C++][Gandiva] Optimize LPAD/RPAD functions#49439

GH-49438: [C++][Gandiva] Optimize LPAD/RPAD functions#49439
kou merged 3 commits intoapache:mainfrom
dmitry-chirkov-dremio:gandiva-lpad-optimization

dmitry-chirkov-dremio commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

dmitry-chirkov-dremio commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

kou commented Mar 4, 2026

Uh oh!

kou left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kou left a comment

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dmitry-chirkov-dremio commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

dmitry-chirkov-dremio commented Mar 3, 2026

Local Benchmark Results

LPAD (mean time in μs)

RPAD (mean time in μs)

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

kou commented Mar 4, 2026

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dmitry-chirkov-dremio commented Mar 3, 2026 •

edited

Loading